
# Project Datasets and Artifacts

This repository contains datasets, project artifacts, and resources for three software projects:

- `AuthoringToolKit`
- `CompreFace`
- `DynamicMailboxes`

Each project folder contains anonymized datasets derived from real development workflows, as well as supporting materials used for automated reasoning, test generation, and patch analysis.

---

## Folder Structure

Each project directory contains the following files:

### 1. `dataset_[project]_anonymized.csv`

This CSV file includes anonymized ticket data with the following columns:

| Column | Description |
|--------|-------------|
| `ticket` | Unique identifier for the issue. |
| `summary` | Short description of the issue. |
| `description` | Detailed description of the issue. |
| `issue_type or label'| Type of issue (e.g., bug, feature). |
| `priority` | Priority level of the issue. |
| `resolution` | Resolution status (e.g., fixed, won't fix). |
| `steps_to_verify_mandatory` | Steps required to verify the issue has been resolved. |
| `steps_to_reproduce_migrated_2` | Steps to reproduce the issue (migrated from original sources). |
| `image_descriptions` | Descriptions of attached images, generated by AI. |
| `full_ticket_description` | Combined summary, description, steps to reproduce, and image descriptions. |
| `before_commit` | Git commit hash before the fix was applied. |
| `merge_commit` | Git commit hash where the fix was merged. |
| `source` | Source system or tool of the ticket. |
| `pos_tests` | List of tests added between `before_commit` and `merge_commit` (used for TDD assessment). |
| `skipped_tests` | Tests that were skipped or failed in both the `before_commit` and `merge_commit`. |
| `interface_mismatch` | Boolean indicating whether the issue might introduce interface mismatches. |
| `full_ticket_description_with_tests` | Combination of `full_ticket_description` and the `pos_tests` content. |

> ⚠ Note: Some columns may be empty depending on ticket content, but all will follow the above structure.

---

### 2. `dataset_[project]_anonymized.jsonl`

This file is a `.jsonl` (JSON Lines) version of the dataset that closely follows the format used by **mSWEAgent**, excluding the test and patch fields. Each line represents an issue with the following structure:

```json
{
  "org": "<organization>",
  "repo": "<repository>",
  "number": <issue_number>,
  "state": "open" | "closed",
  "title": "<issue title>",
  "body": "<full ticket body>",
  "base": {
    "label": "<branch>",
    "ref": "<ref>",
    "sha": "<commit sha>"
  },
  "resolved_issues": [
    {
      "number": <issue_number>,
      "title": "<title>",
      "body": "<body>"
    }
  ],
  "instance_id": "<unique_id>"
}
```

- `org`: GitHub organization name
- `repo`: Repository name
- `number`: Issue number
- `title`: Short issue summary
- `body`: Full issue description including code samples and metadata
- `base`: Reference branch and commit SHA
- `resolved_issues`: Mirror of issue metadata, included for downstream processing
- `instance_id`: Unique identifier in the format `<org>__<repo>_<number>`

---

### 3. `[project].zip`

Zipped copy of the full GitHub project, including full history and codebase. This is used for offline processing, patch validation, and TDD analysis.

---

### 4. `[project]-mock.zip`

Stripped-down version of the original project, containing minimal viable code structure and tests. Useful for benchmarking agent performance on simplified pipelines.

---

### 5. `patches_neg-[project].zip`

Negative patches that, when applied to the `merge_commit`, should return the state of the `base` commit.

These patches are intentionally incorrect. Negative patches are needed for tickets and patches checker. Applying them to the `merge` version should reverse the changes, effectively producing the `base` version:

```
merge + apply(patches_neg) = base
```

---

### 6. `patches_pos-[project].zip`

Positive patches that, when applied to the `base` commit, should result in the `merge_commit`.

These patches represent **ground truth fixes**. Applying them to the `base` (before_commit) version should reproduce the `merge_commit` version:

```
base + apply(patches_pos) = merge
```

---

## Usage

These datasets and artifacts are designed for:

- **TDD Process Evaluation**
- **Agent Reasoning Benchmarks**

They enable realistic simulations of software engineering workflows and support research in AI-assisted programming.

---

## License & Attribution


---

## Contact

For questions, contributions, or feedback, please contact the maintainers.

---
